April 3, 2020
NY Times has released their state and county level date. I will get that data in over the weekend.
April 2, 2020
I started writing this as an example for how to analyze public data in R, which I still think is a great use of shelter-in-place time. But now I’m shifting to thinking this could be a useful way for non-programers to look at data, so I’ll post a page that is just the figures and my explaninations of why I made that figure without all of the code. stay tuned for that.
If you want to run this analysis on your computer without retyping anything, grab the .Rmd.
March 30, 2020
Public data on the spread of coronavirus has been compiled by Johns Hopkins CSSE for their excellent dashboard. They are posting the raw data on github. I’m writing this up because getting data from GitHub into R is pretty easy, but I have to look it up each time.
First, I need to set up my R session by installing and loading packages.
# install.packages("ggplot2")
# install.packages("tidyverse")
# install.packages("gapminder")
# install.packages("gganimate")
# install.packages("gifski")
# install.packages("geofacet")
# install.packages("devtools")
# devtools::install_github("UrbanInstitute/urbnmapr") # had to uninstall the *program Rtools* and download/install the new version to get this to work
# install.packages("transformr")
# install.packages("tweenr")
library(ggplot2)
library(tidyverse)
library(gapminder)
library(gganimate)
library(gifski)
library(geofacet)
library(urbnmapr)
library(transformr)
library(tweenr)
library(sf)
Now let’s fetch the data, currently it looks like the data is updated every evening around 5 EST. We want the raw data not the formated for humans data (click the “Raw” button on the right side of the GitHub window).
covid19cases <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv", header=TRUE)
tibble(covid19cases)
## # A tibble: 264 x 84
## Province.State Country.Region Lat Long X1.22.20 X1.23.20 X1.24.20
## <fct> <fct> <dbl> <dbl> <int> <int> <int>
## 1 "" Afghanistan 33 65 0 0 0
## 2 "" Albania 41.2 20.2 0 0 0
## 3 "" Algeria 28.0 1.66 0 0 0
## 4 "" Andorra 42.5 1.52 0 0 0
## 5 "" Angola -11.2 17.9 0 0 0
## 6 "" Antigua and B~ 17.1 -61.8 0 0 0
## 7 "" Argentina -38.4 -63.6 0 0 0
## 8 "" Armenia 40.1 45.0 0 0 0
## 9 "Australian C~ Australia -35.5 149. 0 0 0
## 10 "New South Wa~ Australia -33.9 151. 0 0 0
## # ... with 254 more rows, and 77 more variables: X1.25.20 <int>,
## # X1.26.20 <int>, X1.27.20 <int>, X1.28.20 <int>, X1.29.20 <int>,
## # X1.30.20 <int>, X1.31.20 <int>, X2.1.20 <int>, X2.2.20 <int>,
## # X2.3.20 <int>, X2.4.20 <int>, X2.5.20 <int>, X2.6.20 <int>, X2.7.20 <int>,
## # X2.8.20 <int>, X2.9.20 <int>, X2.10.20 <int>, X2.11.20 <int>,
## # X2.12.20 <int>, X2.13.20 <int>, X2.14.20 <int>, X2.15.20 <int>,
## # X2.16.20 <int>, X2.17.20 <int>, X2.18.20 <int>, X2.19.20 <int>,
## # X2.20.20 <int>, X2.21.20 <int>, X2.22.20 <int>, X2.23.20 <int>,
## # X2.24.20 <int>, X2.25.20 <int>, X2.26.20 <int>, X2.27.20 <int>,
## # X2.28.20 <int>, X2.29.20 <int>, X3.1.20 <int>, X3.2.20 <int>,
## # X3.3.20 <int>, X3.4.20 <int>, X3.5.20 <int>, X3.6.20 <int>, X3.7.20 <int>,
## # X3.8.20 <int>, X3.9.20 <int>, X3.10.20 <int>, X3.11.20 <int>,
## # X3.12.20 <int>, X3.13.20 <int>, X3.14.20 <int>, X3.15.20 <int>,
## # X3.16.20 <int>, X3.17.20 <int>, X3.18.20 <int>, X3.19.20 <int>,
## # X3.20.20 <int>, X3.21.20 <int>, X3.22.20 <int>, X3.23.20 <int>,
## # X3.24.20 <int>, X3.25.20 <int>, X3.26.20 <int>, X3.27.20 <int>,
## # X3.28.20 <int>, X3.29.20 <int>, X3.30.20 <int>, X3.31.20 <int>,
## # X4.1.20 <int>, X4.2.20 <int>, X4.3.20 <int>, X4.4.20 <int>, X4.5.20 <int>,
## # X4.6.20 <int>, X4.7.20 <int>, X4.8.20 <int>, X4.9.20 <int>, X4.10.20 <int>
covid19deaths <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv", header=TRUE)
tibble(covid19deaths)
## # A tibble: 264 x 84
## Province.State Country.Region Lat Long X1.22.20 X1.23.20 X1.24.20
## <fct> <fct> <dbl> <dbl> <int> <int> <int>
## 1 "" Afghanistan 33 65 0 0 0
## 2 "" Albania 41.2 20.2 0 0 0
## 3 "" Algeria 28.0 1.66 0 0 0
## 4 "" Andorra 42.5 1.52 0 0 0
## 5 "" Angola -11.2 17.9 0 0 0
## 6 "" Antigua and B~ 17.1 -61.8 0 0 0
## 7 "" Argentina -38.4 -63.6 0 0 0
## 8 "" Armenia 40.1 45.0 0 0 0
## 9 "Australian C~ Australia -35.5 149. 0 0 0
## 10 "New South Wa~ Australia -33.9 151. 0 0 0
## # ... with 254 more rows, and 77 more variables: X1.25.20 <int>,
## # X1.26.20 <int>, X1.27.20 <int>, X1.28.20 <int>, X1.29.20 <int>,
## # X1.30.20 <int>, X1.31.20 <int>, X2.1.20 <int>, X2.2.20 <int>,
## # X2.3.20 <int>, X2.4.20 <int>, X2.5.20 <int>, X2.6.20 <int>, X2.7.20 <int>,
## # X2.8.20 <int>, X2.9.20 <int>, X2.10.20 <int>, X2.11.20 <int>,
## # X2.12.20 <int>, X2.13.20 <int>, X2.14.20 <int>, X2.15.20 <int>,
## # X2.16.20 <int>, X2.17.20 <int>, X2.18.20 <int>, X2.19.20 <int>,
## # X2.20.20 <int>, X2.21.20 <int>, X2.22.20 <int>, X2.23.20 <int>,
## # X2.24.20 <int>, X2.25.20 <int>, X2.26.20 <int>, X2.27.20 <int>,
## # X2.28.20 <int>, X2.29.20 <int>, X3.1.20 <int>, X3.2.20 <int>,
## # X3.3.20 <int>, X3.4.20 <int>, X3.5.20 <int>, X3.6.20 <int>, X3.7.20 <int>,
## # X3.8.20 <int>, X3.9.20 <int>, X3.10.20 <int>, X3.11.20 <int>,
## # X3.12.20 <int>, X3.13.20 <int>, X3.14.20 <int>, X3.15.20 <int>,
## # X3.16.20 <int>, X3.17.20 <int>, X3.18.20 <int>, X3.19.20 <int>,
## # X3.20.20 <int>, X3.21.20 <int>, X3.22.20 <int>, X3.23.20 <int>,
## # X3.24.20 <int>, X3.25.20 <int>, X3.26.20 <int>, X3.27.20 <int>,
## # X3.28.20 <int>, X3.29.20 <int>, X3.30.20 <int>, X3.31.20 <int>,
## # X4.1.20 <int>, X4.2.20 <int>, X4.3.20 <int>, X4.4.20 <int>, X4.5.20 <int>,
## # X4.6.20 <int>, X4.7.20 <int>, X4.8.20 <int>, X4.9.20 <int>, X4.10.20 <int>
Looking at other peoples’ data is always interesting, even really nicely organized data like this. Some programs (i.e. SAS) wants each timepoint to have its own variable. In R we want tidy data.
We have 4 variables with location data (Province.State, Country.Region, Lat, Long) and a bunch of columns with each day’s data. The date columns all read in with “X” at the beginning of the variable name because R doesn’t like variable names to start with numbers. We can use that to specify how to tidy the data. If you’ve tidied data in R for a while you might have used, melt/cast or gather/spread. The tidyverse has come out with a new way that is a bit more intuitive, pivot_longer/pivot_wider.
cases <- covid19cases %>%
pivot_longer( cols = starts_with("X"), # I'm specifing the coluns that I want to tidy (use "-" to specify the columns you want to leave alone)
names_to = "date", # variable name for the variables column
names_prefix = "X", # let's get rid of the "X" so we can treat dates as dates
values_to = "confirmedCases") # variable name for the column that will hold the values
tibble(cases)
## # A tibble: 21,120 x 6
## Province.State Country.Region Lat Long date confirmedCases
## <fct> <fct> <dbl> <dbl> <chr> <int>
## 1 "" Afghanistan 33 65 1.22.20 0
## 2 "" Afghanistan 33 65 1.23.20 0
## 3 "" Afghanistan 33 65 1.24.20 0
## 4 "" Afghanistan 33 65 1.25.20 0
## 5 "" Afghanistan 33 65 1.26.20 0
## 6 "" Afghanistan 33 65 1.27.20 0
## 7 "" Afghanistan 33 65 1.28.20 0
## 8 "" Afghanistan 33 65 1.29.20 0
## 9 "" Afghanistan 33 65 1.30.20 0
## 10 "" Afghanistan 33 65 1.31.20 0
## # ... with 21,110 more rows
deaths <- covid19deaths %>%
pivot_longer( cols = starts_with("X"),
names_to = "date",
names_prefix = "X",
values_to = "deaths")
tibble(deaths)
## # A tibble: 21,120 x 6
## Province.State Country.Region Lat Long date deaths
## <fct> <fct> <dbl> <dbl> <chr> <int>
## 1 "" Afghanistan 33 65 1.22.20 0
## 2 "" Afghanistan 33 65 1.23.20 0
## 3 "" Afghanistan 33 65 1.24.20 0
## 4 "" Afghanistan 33 65 1.25.20 0
## 5 "" Afghanistan 33 65 1.26.20 0
## 6 "" Afghanistan 33 65 1.27.20 0
## 7 "" Afghanistan 33 65 1.28.20 0
## 8 "" Afghanistan 33 65 1.29.20 0
## 9 "" Afghanistan 33 65 1.30.20 0
## 10 "" Afghanistan 33 65 1.31.20 0
## # ... with 21,110 more rows
Both the cases and deaths data seem to match row by row, however I want to join based on a unique key rather than trusting that row 15873 in cases goes with row 15873 in deaths. To join we need to create a unique key for each observation. Initially I thought that Lat/Long would provide that but there were cruise ship observations with the same Lat/Long but different ships. The way that I figured this out was by joining on Lat/Long and seeing that the total number of observations in the new dataframe was larger than the originals. then I looked for duplicates. So we’re going to create a new variable called “join” in both dataframes that is just pasting all of our location variables together.
cases$join <- paste(cases$Province.State, cases$Country.Region ,cases$Lat, cases$Long, cases$date)
deaths$join <- paste(deaths$Province.State, deaths$Country.Region,deaths$Lat, deaths$Long, deaths$date)
tibble(cases)
## # A tibble: 21,120 x 7
## Province.State Country.Region Lat Long date confirmedCases join
## <fct> <fct> <dbl> <dbl> <chr> <int> <chr>
## 1 "" Afghanistan 33 65 1.22.~ 0 " Afghanista~
## 2 "" Afghanistan 33 65 1.23.~ 0 " Afghanista~
## 3 "" Afghanistan 33 65 1.24.~ 0 " Afghanista~
## 4 "" Afghanistan 33 65 1.25.~ 0 " Afghanista~
## 5 "" Afghanistan 33 65 1.26.~ 0 " Afghanista~
## 6 "" Afghanistan 33 65 1.27.~ 0 " Afghanista~
## 7 "" Afghanistan 33 65 1.28.~ 0 " Afghanista~
## 8 "" Afghanistan 33 65 1.29.~ 0 " Afghanista~
## 9 "" Afghanistan 33 65 1.30.~ 0 " Afghanista~
## 10 "" Afghanistan 33 65 1.31.~ 0 " Afghanista~
## # ... with 21,110 more rows
tibble(deaths)
## # A tibble: 21,120 x 7
## Province.State Country.Region Lat Long date deaths join
## <fct> <fct> <dbl> <dbl> <chr> <int> <chr>
## 1 "" Afghanistan 33 65 1.22.20 0 " Afghanistan 33 65~
## 2 "" Afghanistan 33 65 1.23.20 0 " Afghanistan 33 65~
## 3 "" Afghanistan 33 65 1.24.20 0 " Afghanistan 33 65~
## 4 "" Afghanistan 33 65 1.25.20 0 " Afghanistan 33 65~
## 5 "" Afghanistan 33 65 1.26.20 0 " Afghanistan 33 65~
## 6 "" Afghanistan 33 65 1.27.20 0 " Afghanistan 33 65~
## 7 "" Afghanistan 33 65 1.28.20 0 " Afghanistan 33 65~
## 8 "" Afghanistan 33 65 1.29.20 0 " Afghanistan 33 65~
## 9 "" Afghanistan 33 65 1.30.20 0 " Afghanistan 33 65~
## 10 "" Afghanistan 33 65 1.31.20 0 " Afghanistan 33 65~
## # ... with 21,110 more rows
covid19 <- cases %>%
select(join, confirmedCases) %>%
left_join( deaths, by = "join") %>%
select(Province.State, Country.Region, Lat, Long, date, confirmedCases, deaths)
# R thinks of dates as chr, let's specify they are dates
covid19$date <- as.Date(covid19$date, format = "%m.%d.%y")
# the US is just "US" going to fix so it will join with gapminder data and South Korea is "Korea, South"
covid19$Country.Region <- recode(covid19$Country.Region, "US" = "United States", "Korea, South"= "Korea, Rep." )
Now that the data is tidied and joined, it’s time to explore.
Nearly tidied because the valued in confirmedCases and deaths are the same thing-counts of individuals. Depending on what you want to do with the data, you may way to tidy those further.
#Plot the cases and deaths over time
ggplot(data=covid19)+
scale_y_continuous(labels = scales::comma)+
geom_bar(aes(x=date, y=confirmedCases), stat="identity")+
geom_bar(aes(x=date, y=deaths), stat="identity", fill="red")+
theme_bw()+
ggtitle("Global cumulative cases")
data("gapminder_unfiltered")
covid19 <- left_join(covid19, gapminder_unfiltered[gapminder_unfiltered$year=="2007",], by = c("Country.Region" = "country"))
todayscovid19 <- covid19 %>%
filter(date==max(date)) %>%
group_by(continent, Country.Region, pop)%>%
summarise(confirmedCases = sum(confirmedCases),
deaths = sum(deaths))
ggplot(todayscovid19, aes(x=confirmedCases*1000/pop, y=deaths*1000/pop, color=continent))+
geom_point(size=3)+
theme_bw()+
scale_color_brewer(palette="Set1")+
geom_text(data=subset(todayscovid19, confirmedCases*1000/pop > 1.50),
aes(x=confirmedCases*1000/pop, y=deaths*1000/pop,label=Country.Region), vjust="inward", hjust="inward", color="Black")+
labs(x="Cases per 1000", y="Deaths per 1000")+
geom_text(data=subset(todayscovid19, Country.Region == "United States"),
aes(x=confirmedCases*1000/pop, y=deaths*1000/pop, label="US"), size=5, color="black")
# geom_text(data=subset(todayscovid19, Country.Region == "India"),
# aes(x=confirmedCases*1000/pop, y=deaths*1000/pop, label="India"), size=5, color="black")
ggsave(file=paste0("PerCapitaCasesDeaths", format(Sys.Date(), "%B%d"), ".jpg"))
### lets see if I can animate by day, this works but the text is all over the place, turning that off for now
anim <- ggplot(covid19, aes(x=confirmedCases*1000/pop, y=deaths*1000/pop, color=continent))+
geom_point(size=3)+
theme_bw()+
scale_color_brewer(palette="Set1")+
# geom_text(data=subset(covid19, confirmedCases*1000/pop > .15),
# aes(x=confirmedCases*1000/pop, y=deaths*1000/pop,label=Country.Region), vjust="inward", hjust="inward", color="Black")+
labs(x="Cases per 1000 ", y="Deaths per 1000")+
# geom_text(data=subset(covid19, Country.Region == "United States"),
# aes(x=confirmedCases*1000/pop, y=deaths*1000/pop, label="US"), size=5, color="black")+
transition_time(time=covid19$date)+
ggtitle('Now showing {frame_time}')
animate(anim, renderer = gifski_renderer(), end_pause = 15)
anim_save(filename="casesDeathsPerCapita.gif", animation=last_animation())
I started working on this becasue Kid1 wanted the numbers for the US and Canada each day. You can chain together all commands and not make an object of just the US and Canada, but I like to check that I’ve changed the data the way I think I’m changing the data
#check spelling/caps for countries
# levels(as.factor(cases$Country.Region))
usCanada <- covid19 %>%
filter(Country.Region %in% c("United States", "Canada")) %>%
droplevels() %>%
group_by(date, Country.Region) %>%
summarise(confirmedCases = sum(confirmedCases),
deaths = sum(deaths))
Write this daily data to a csv for Kid1 to open and paste into excel because he isn’t interested in learning R yet.
write.csv(usCanada, file=paste0("usCanadaDaily", format(Sys.Date(),"%B%d"),".csv"))
I can’t help but make a graph for him as well, because seeing data is fundimental to understanding the numbers
ggplot(usCanada, aes(x=date, y=confirmedCases, color=Country.Region))+
scale_y_continuous(labels = scales::comma)+
geom_point()+
theme_classic()
Making the data fully tidy allows for other visualizations
usCanadaLong <- pivot_longer(usCanada, -c(date, Country.Region), names_to="caseType", values_to = "count")
ggplot(usCanadaLong, aes(x=date, y=count, fill=caseType))+
geom_bar(stat="identity")+
scale_y_continuous(labels = scales::comma)+
facet_wrap(vars(Country.Region), scales = "free_y")+
theme_classic()
The lack of testing in the US is a critical concern in stopping the spread of the virus. And it’s personally/professionally interesting because qPCR is something that we do in (MARS)[https://mars.uconn.edu] often. The (Covid Tracking Project)[https://covidtracking.com] is trying to compile all testing data, not just positives from each state.
# Read in current data
usDaily <- read.csv("https://covidtracking.com/api/states/daily.csv", header=T)
usDaily$date <- as.Date(as.character(usDaily$date), format="%Y%m%d")
# read in US population data
usPop <- read.csv("https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/state/detail/SCPRC-EST2019-18+POP-RES.csv", header=T)
# CovidProject has states 2 letter abbreviation. The census has states spelled out. We need a linker file.
usAbb <- read.csv("https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv", header=T, stringsAsFactors = F)
usPop <- left_join(usPop, usAbb, by=c("NAME" = "State"))
# Puerto Rico isn't listed correctly, just going to fix manually
usPop[usPop$NAME == "Puerto Rico Commonwealth", "Abbreviation"] <- "PR"
# regions and divisions are numberic rather than words
usPop$REGION <- recode(usPop$REGION,
"0" = "US",
"1" = "Northeast",
"2" = "Midwest",
"3" = "South",
"4" = "West",
"X" = "Puerto Rico"
)
usPop$DIVISION <- recode(usPop$DIVISION,
"0" = "US",
"1" = "New England",
"2" = "Middle Atlantic",
"3" = "East North Central",
"4" = "West North Central",
"5" = "South Atlantic",
"6" = "East South Central",
"7" = "West South Central",
"8" = "Mountain",
"9" = "Pacific",
"X" = "Puerto Rico" )
# add Puerto Rico to geofacet
my_gird <- us_state_grid1 %>% add_row(row=6, col=11, code="PR",name="Puerto Rico")
Calculate the positive rate which is important for determining how prevelant cases are in the community.
usDaily$posRate <- usDaily$positive/usDaily$total
usDailyAll <- left_join(usDaily, usPop, by=c("state" = "Abbreviation")) %>%
mutate(posTotalPerCap = (positive*1000)/POPESTIMATE2019,
negTotalPerCap = (negative*1000)/POPESTIMATE2019,
hospTotalPerCap = (hospitalized*1000)/POPESTIMATE2019,
deathTotalPerCap = (death*1000)/POPESTIMATE2019,
totalTotalPerCap = (total*1000)/POPESTIMATE2019,
deathPerCap = (deathIncrease*1000)/POPESTIMATE2019,
hospitalizedPerCap = (hospitalizedIncrease*1000)/POPESTIMATE2019,
positivePerCap = (positiveIncrease*1000)/POPESTIMATE2019,
testsPerCap = (totalTestResultsIncrease*1000)/POPESTIMATE2019)
usDailyAll$REGION <- droplevels(usDailyAll$REGION)
usDailyAll$DIVISION <- droplevels(usDailyAll$DIVISION)
We’ll tidy the Covid Project data so we can plot it .
usDailyIncrease <- usDailyAll %>%
select(c(date, state, deathIncrease, hospitalizedIncrease, negativeIncrease, positiveIncrease, totalTestResultsIncrease))
usDaily <- usDailyAll %>%
select(c(date, state, positive, negative, hospitalized, death, posRate, total))
usDailyLong <- pivot_longer(usDaily, -c(date, state), names_to = "testType", values_to = "count")
# Again, I could do this filtering before each plot or I can make a separate object. I'm the new opject kind of R person.
usDailyLongNoTotal <- usDailyLong %>%
filter(testType %in% c("negative", "pending", "positive", "hospitalized", "death"))
# I want to set the order to display the results
usDailyLongNoTotal$testType <- factor(usDailyLongNoTotal$testType, levels = c("negative", "pending", "positive", "hospitalized", "death"))
# Also, want to set the color for each result type
testType.col <- c("negative" = "grey",
"pending" = "lightgrey",
"positive" = "red",
"hospitalized" = "purple",
"death" = "black")
ggplot(usDailyLongNoTotal)+
geom_bar(stat="identity", aes(x=date, y=count,fill=testType))+
scale_fill_manual(values=testType.col)+
# scale_y_log10()+
theme_classic()+
scale_y_continuous(labels = scales::comma)+
ggtitle("Outcome of tests all of US")
usDailyLongNoTotal%>%
filter(testType %in% c("positive", "hospitalized", "death"))%>%
ggplot()+
geom_bar(stat="identity", aes(x=date, y=count,fill=testType))+
scale_fill_manual(values=testType.col)+
scale_y_continuous(labels = scales::comma)+
theme_classic()+
ggtitle("Positive cases all of US")
Total tests adminstered by each state.
ggplot(usDaily, aes(x=date, y=total))+
geom_line()+
# facet_wrap(~state)+ #wraping by state name
facet_geo(~state, grid=my_gird)+ #wraping by state location
# scale_y_log10()+
scale_y_continuous(labels = scales::comma)+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
ggtitle("Tests administered by state")
Testing is very patchy by state, so lets look at the positive rate through time for each state
ggplot(usDaily, aes(x=date, y=posRate))+
geom_line()+
# facet_wrap(~state)+ #wraping by state name
facet_geo(~state, grid=my_gird)+ #wraping by state location
theme_classic()+
scale_y_continuous(labels = scales::comma)+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
ggtitle("Positive test rate by state")
It’s hard to compare raw numbers of tests because of different population sizes in each state. So lets calculate tests and positives per capita as permil (1‰ is 1 in 1000, as compared to 1% is 1 in 100)
ggplot(usDailyAll, aes(x=date, y=posTotalPerCap))+
geom_line()+
# facet_wrap(~state)+ #wraping by state name
facet_geo(~state, grid=my_gird)+ #wraping by state location
theme_classic()+
scale_y_continuous(labels = scales::comma)+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
ggtitle("Positive test per 1000 by state")
ggplot(usDailyAll, aes(x=date, y=totalTotalPerCap))+
geom_line()+
# facet_wrap(~state)+ #wraping by state name
facet_geo(~state, grid=my_gird)+ #wraping by state location
theme_classic()+
scale_y_continuous(labels = scales::comma)+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
ggtitle("Total test per 1000 by state")
ggsave(filename = "perCapitaTotalTests.jpg")
anim <- ggplot(usDailyAll, aes(x=posTotalPerCap, y=deathTotalPerCap, color=REGION))+
geom_point(size=3)+
theme_bw()+
scale_color_brewer(palette="Set1")+
geom_text(data=subset(usDailyAll, posTotalPerCap > 1.5),
aes(x=posTotalPerCap, y=deathTotalPerCap,label=NAME), vjust="inward", hjust="inward", color="Black")+
labs(x="Cases per 1000 ", y="Deaths per 1000")+
transition_time(time=usDailyAll$date)+
ggtitle('Now showing {frame_time}')
animate(anim, renderer = gifski_renderer(), end_pause = 25)
anim_save(filename="casesDeathsPerCapitaUSregion.gif", animation=last_animation())
usToday <- usDailyAll %>%
filter(date==max(date))
ggplot(usToday, aes(x=posTotalPerCap, y=deathTotalPerCap, color=DIVISION))+
geom_point(size=3)+
theme_bw()+
scale_color_brewer(palette="Paired")+
geom_text(data=subset(usToday, posTotalPerCap > 1.5),
aes(x=posTotalPerCap, y=deathTotalPerCap,label=state), vjust="outward", hjust="inward",color="Black")+
labs(x="Cases per 1000 ", y="Deaths per 1000")
ggsave("casesDeathsPerCapita.jpg", width = 8, height=5)
anim <- ggplot(usDailyAll, aes(x=posTotalPerCap, y=deathTotalPerCap, color=DIVISION))+
geom_point(size=3)+
theme_bw()+
scale_color_brewer(palette="Paired")+
# geom_text(data=subset(usDailyAll, posTotalPerCap > 1.5),
# aes(x=posTotalPerCap, y=deathTotalPerCap,label=NAME), vjust="inward", hjust="inward", color="Black")+
labs(x="Cases per 1000 ", y="Deaths per 1000")+
transition_time(time=usDailyAll$date)+
ggtitle('Now showing {frame_time}')
animate(anim, renderer = gifski_renderer(), end_pause = 25)
anim_save(filename="casesDeathsPerCapitaUSregion.gif", animation=last_animation())
It looks like NY’s case fatality rate (CFR) is ~3% which is close to Wuhan’s rate, less than Italy and Spains rate. But higher than many were hoping (cruise ship rate was under 1% which is likely good data because so many passengers were tested). I decided to plot death rate (CFR) vs cases per 1000 population. This graph puzzeled me when I first was staring at it-initially the per capita cases were very low but the death rate was very high. Which I believe is the signal of under testing/under diagnosing, only the very sick are tested. This suggests that many states are still under testing because they still have a CFR greater than NY which has said that they can’t test mild cases.
anim <- ggplot(usDailyAll, aes(x=posTotalPerCap, y=death/positive, color=DIVISION))+
geom_point(size=3)+
theme_bw()+
scale_color_brewer(palette="Paired")+
# geom_text(data=subset(usDailyAll, posTotalPerCap > 1.5),
# aes(x=posTotalPerCap, y=deathTotalPerCap,label=NAME), vjust="inward", hjust="inward", color="Black")+
labs(x="Cases per 1000 ", y="Case Fatality Rate")+
# geom_text(data=subset(covid19, Country.Region == "United States"),
# aes(x=confirmedCases*1000/pop, y=deaths*1000/pop, label="US"), size=5, color="black")+
transition_time(time=usDailyAll$date)+
ggtitle('Now showing {frame_time}')
animate(anim, renderer = gifski_renderer(), end_pause = 25)
anim_save(filename="CFRtime.gif", animation=last_animation())
We are still under testing, looking at the increases in tests and positives each day can help demonstrate that.
usDailyIncreaselong<- usDailyIncrease%>%
pivot_longer(-c(date, state), names_to = "testType", values_to = "count")%>%
filter(testType %in% c("deathIncrease", "positiveIncrease", "hospitalizedIncrease"))
usTotalsIncrease <- usDailyIncrease%>%
group_by(date)%>%
summarise(deathIn = sum(deathIncrease),
hospitalizedIn = sum(hospitalizedIncrease),
positiveIn = sum(positiveIncrease),
negativeIn = sum(negativeIncrease),
totalTestIn = sum(totalTestResultsIncrease))
ggplot()+
geom_bar(data=usDailyIncreaselong,stat="identity", aes(x=date, y=count,fill=testType))+
# scale_fill_manual(values=testType.col)+
theme_bw()+
ggtitle("Daily outcome of tests, not cumulative")
ggplot(data=usTotalsIncrease, aes(x=date, y=negativeIn))+
geom_line()+
theme_bw()+
ggtitle("Daily increase in negative tests, not cumulative")
ggplot(usDailyIncrease)+
geom_line(aes(x=date, y=totalTestResultsIncrease))+
# facet_wrap(~state)+ #wraping by state name
facet_geo(~state, grid=my_gird)+ #wraping by state location
theme_classic()+
# scale_y_log10()+
ggtitle("Daily increase in tests")
usIncPop <- left_join(usDailyIncrease, usPop, by=c("state" = "Abbreviation")) %>%
mutate(deathPerCap = (deathIncrease*1000)/POPESTIMATE2019,
hospitalizedPerCap = (hospitalizedIncrease*1000)/POPESTIMATE2019,
positivePerCap = (positiveIncrease*1000)/POPESTIMATE2019,
testsPerCap = (totalTestResultsIncrease*1000)/POPESTIMATE2019)
ggplot(usIncPop, aes(x=date, y=positivePerCap))+
geom_line()+
# facet_wrap(~state)+ #wraping by state name
facet_geo(~state, grid=my_gird)+ #wraping by state location
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
ggtitle("Daily positive test per 1000 by state")
ggsave("DailyPosPerCapitaState.jpg", width=10, height=8)
ggplot(usIncPop, aes(x=date, y=testsPerCap))+
geom_line()+
# facet_wrap(~state)+ #wraping by state name
facet_geo(~state, grid=my_gird)+ #wraping by state location
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
ggtitle("Daily tests per 1000 by state")
ggsave("DailyTestsPerCapitaState.jpg", width=10, height=8)
ggplot(usIncPop, aes(x=date, y=deathPerCap))+
geom_line()+
# facet_wrap(~state)+ #wraping by state name
facet_geo(~state, grid=my_gird)+ #wraping by state location
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
ggtitle("Daily tests per 1000 by state")
ggsave("DailyDeathsPerCapitaState.jpg", width=10, height=8)
usDailyLongNoTotal %>%
filter(state == "CT") %>%
ggplot()+
geom_bar(stat="identity", aes(x=date, y=count,fill=testType))+
scale_fill_manual(values=testType.col)+
# scale_y_log10()+
theme_classic()+
ggtitle("CT outcome of tests")
# ggsave(file=paste0("ctTestOutCome", format(Sys.Date(), "%B%d"), ".jpg"))
usDaily %>%
filter(state == "CT") %>%
ggplot(aes(x=date, y=posRate))+
geom_line()+
theme_bw()+
ggtitle("CT positive test rate")
There was a big spike in positive rate March 18, which means they were only testing people who were likley sick. If we look at total tests administered, there’s been a significant increase in testing since then. But our positive test result (and statements from state public health) tell us that we are still under testing.
usDaily %>%
filter(state == "CT") %>%
ggplot(aes(x=date, y=total))+
geom_line()+
theme_bw()+
ggtitle("CT test administered")
New York Times has released their data on US cases at the county level.
NY Times has collected and made open their county level data, so we must plot it. I’m limiting the color scale for the maps of cases per capita because 3 ski towns in the west have incredibly high cases per capita, the scale doesn’t change after a county hits 10 cases per 1000 residents (1%) even though some are now nearly 2%.
nytcountypop %>%
filter(date==max(date)) %>%
ggplot() +
geom_boxplot(aes(x=state, y=casesPerCap))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p <- nytcountypop %>%
filter(date==max(date)) %>%
ggplot(aes(long, lat, group=group))
p+ geom_polygon(aes(fill=casesPerCap))+
geom_polygon(data = states, mapping = aes(long, lat, group = group), fill = NA, color = "light grey") +
scale_fill_viridis_c(option = "inferno",direction = -1, limits = c(0, 12))+
coord_map(projection = "albers", lat0 = 39, lat1=45)+
labs(fill="Cases per capita")+
theme_bw()+
theme(legend.position = "bottom")+
ggtitle(paste("Cumulative cases per 1000 residents of county", max(nytcountypop$date)))
ggsave("MapCasesPerCapitaCounty.jpg", width = 12, height=12)
p+ geom_polygon(aes(fill=deathsPerCap))+
geom_polygon(data = states, mapping = aes(long, lat, group = group), fill = NA, color = "light grey") +
scale_fill_viridis_c(option = "magma",direction = -1)+
coord_map(projection = "albers", lat0 = 39, lat1=45)+
labs(fill="Deaths per capita")+
theme_bw()+
theme(legend.position = "bottom")+
ggtitle(paste("Cumulative deaths per 1000 residents of county", max(nytcountypop$date)))
ggsave("MapDeathsPerCapitaCounty.jpg", width = 12, height=12)
p+ geom_polygon(aes(fill=deaths/cases))+
geom_polygon(data = states, mapping = aes(long, lat, group = group), fill = NA, color = "light grey") +
scale_fill_viridis_c(option = "viridis",direction = -1)+
coord_map(projection = "albers", lat0 = 39, lat1=45)+
labs(fill="Case Fatality Rate")+
theme_bw()+
theme(legend.position = "bottom")+
ggtitle(paste("Cumulative case fatality rate by county", max(nytcountypop$date)))
ggsave("MapDeathsPerCaseCounty.jpg", width = 12, height=12)
Looking at data from other countries with better testing and who are further into the pandemic than the US, the true case fatality rate seems to be between 0.5-3%. We can use this to map the areas of the country that are still undertesting. We’ll use a really bad, much worse than we think might be true cutoff of 5%.
p+ geom_polygon(aes(fill=deaths/cases))+
geom_polygon(data = states, mapping = aes(long, lat, group = group), fill = NA, color = "light grey") +
scale_fill_viridis_c(option = "viridis",direction = -1, limits = c(0,.05))+
coord_map(projection = "albers", lat0 = 39, lat1=45)+
labs(fill="Case Fatality Rate")+
theme_bw()+
theme(legend.position = "bottom")+
ggtitle(paste("Counties that are undertesting using cumulative case fatality rate ", max(nytcountypop$date)))
ggsave("MapDeathsPerCaseCounty.jpg", width = 12, height=12)
anim <- nytcountypop %>%
filter(!is.na(casesPerCap)) %>%
ggplot(aes(long, lat, group= fips))+
geom_polygon(aes(fill=casesPerCap))+
geom_polygon(data = states, mapping = aes(long, lat, group = group), fill = NA, color = "light grey") +
scale_fill_viridis_c(option = "inferno",direction = -1)+
coord_map(projection = "albers", lat0 = 39, lat1=45)+
labs(fill="Cases per capita")+
theme_bw()+
theme(legend.position = "bottom")+
transition_time(time=date)+
ggtitle('Now showing {frame_time}')
animate(anim, renderer = gifski_renderer(), end_pause = 20, fps = 5)
anim_save(filename="mapCasesPerCapCounty.gif", animation=last_animation())
anim <- nytcountypop %>%
filter(!is.na(deathsPerCap)) %>%
ggplot(aes(long, lat, group=group))+
geom_polygon(aes(fill=deathsPerCap))+
geom_polygon(data = states, mapping = aes(long, lat, group = group), fill = NA, color = "light grey") +
scale_fill_viridis_c(option = "magma",direction = -1)+
coord_map(projection = "albers", lat0 = 39, lat1=45)+
labs(fill="Deaths per capita")+
theme_bw()+
theme(legend.position = "bottom")+
transition_time(time=date)+
ggtitle('Now showing {frame_time}')
animate(anim, renderer = gifski_renderer(), end_pause = 20, fps = 5)
anim_save(filename="mapDeathsPerCapCounty.gif", animation=last_animation())
nytcounty$pastCasesDate <- as.Date(as.Date(nytcounty$date) - 21)
# calculate past cases based on deaths, using 1.1% (Diamond Princess), 3% (China), and 7% (Spain/Italy) CFR
nytcounty$pastCasesCFR01 <- nytcounty$deaths/0.011
nytcounty$pastCasesCFR03 <- nytcounty$deaths/0.03
nytcounty$pastCasesCFR07 <- nytcounty$deaths/0.07
ggplot(data = nytcounty) +
geom_line(aes(x=as.Date(pastCasesDate), y=pastCasesCFR01), color = "green")+
geom_line(aes(x=as.Date(pastCasesDate), y=pastCasesCFR03), color = "orange")+
geom_line(aes(x=as.Date(pastCasesDate), y=pastCasesCFR07), color = "red")+
geom_line(aes(x = as.Date(date), y = cases), color = "black")+
facet_geo(~state, grid=my_gird, scales="free_y")+ #wraping by state location
scale_y_continuous(labels = scales::comma)+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))+
ggtitle("calculated cases based on deaths, 3 CFR 1% green, 3% orange, 7% red")
## Note: You provided a user-specified grid. If this is a generally-useful
## grid, please consider submitting it to become a part of the geofacet
## package. You can do this easily by calling:
## grid_submit(__grid_df_name__)
## Some values in the specified facet_geo column 'state' do not match the
## 'name' column of the specified grid and will be removed: Virgin
## Islands, Guam, Northern Mariana Islands
ggsave("estCasesByState.jpg", width = 20, height = 12)
This map/graph shows official, detected cases in black. Then I used deaths to calculate cases in the past using 3 different case fatality rates (1% Diamond Princess, 3% China, 7% Spain/Italy). Since it take people ~3 weeks to die from Covid19, we can calculate back that for every death today, there were 100 or 33 or 14 cases 3 weeks ago depending on what CFR we use. I"m sure there is a much better way to display this.